Introduction

subdiagnosis <- readr::read_tsv(
  file.path("..", "..", "..", "data", "current", "SCPCP000006", "single_cell_metadata.tsv"),
  show_col_types = FALSE
) |>
  dplyr::filter(scpca_sample_id == params$sample_id) |>
  dplyr::pull(subdiagnosis)

This notebook explores using CopyKAT to estimate tumor and normal cells in SCPCS000208 from SCPCP000006. This sample has a(n) Anaplastic subdiagnosis.

CopyKAT was run using the copyKAT.R script with and without a normal reference, using either an euclidean or statistical (spearman) method to calculate distance in copyKAT.

These results are read into this notebook and used to:

  • Visualize diploid and aneuploid cells on the UMAP.
  • Evaluate common copy number gains and losses in Wilms tumor.
  • Calculate the confusion matrix comparing manual annotations of tumor cells to using CopyKAT to annotate tumor cells.
  • Compare the annotations from CopyKAT to cell type annotations using label transfer and the fetal (kidney) references.

Packages

library(Seurat)
library(SCpubr)             # for plotting 
library(tidyverse)
library(patchwork)

set.seed(params$seed)

Base directories

# The base path for the OpenScPCA repository, found by its (hidden) .git directory
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current", "SCPCP000006")

# The path to this module
module_base <- file.path(repository_base, "analyses", "cell-type-wilms-tumor-06")

Input files

In this notebook, we are working with the Wilms tumor sample defined in SCPCS000208 from the Wilms tumor dataset “SCPCP000006”. We work with the pre-processed and labeled Seurat object that is the output of 02b_label-transfer_fetal_kidney_reference_Stewart.Rmd saved in the results directory.

result_dir <- file.path(module_base, "results", params$sample_id)
predictions_paths <- list()
full_ck_result_paths <- list()
for(ref_value in c("ref", "noref")){
  for(distance_value in c("euclidean", "spearman")){
    predictions_file <- glue::glue("05_copykat_", {params$sample_id}, "_",ref_value,"_distance-", distance_value, "_copykat_prediction.txt")
    full_ck_result_file <- glue::glue("05_copykat_", {params$sample_id}, "_",ref_value,"_distance-", distance_value, "_copykat_CNA_results.txt")

    predictions_paths[[glue::glue(ref_value, "_", distance_value)]] <- file.path(result_dir, "05_copyKAT", ref_value, distance_value, predictions_file)
    full_ck_result_paths[[glue::glue(ref_value, "_", distance_value)]] <- file.path(result_dir, "05_copyKAT", ref_value, distance_value, full_ck_result_file)
  }
}

Output file

Reports will be saved in the notebook directory. The pre-processed and annotated Seurat object per samples are saved in the result folder.

Functions

Here we defined function that will be used multiple time all along the notebook.

Analysis

Load the pre-processed Seurat object

# open the processed rds object
srat <- readRDS(file.path(result_dir, paste0("02b-fetal_kidney_label-transfer_", params$sample_id,".Rds")))
DefaultAssay(srat) <- "SCT"

CopyKAT results

Below we look at the heatmaps produced by CopyKAT.

Heatmap without reference

Euclidean distance

Spearman distance

Heatmap with endothelial cells as reference

Euclidean distance

Spearman distance

UMAP

Below we prepare and plot a UMAP that shows which cells are classified as diploid, aneuploid, and not defined by CopyKAT. We show a side by side UMAP with results from running CopyKAT both with and without a reference of normal cells.

# read in ck predictions from both reference types (no_normal and with_normal)
ck_results_df <- predictions_paths |> 
  purrr::map(readr::read_tsv) |> 
  dplyr::bind_rows(.id = "reference_used")

# get umap coordinate
umap_df <- srat[["umap"]]@cell.embeddings |>
  as.data.frame() |>
  rownames_to_column('barcodes')

cnv_df <- umap_df |>
  dplyr::left_join(ck_results_df, by = c("barcodes" = "cell.names")) 
ggplot(cnv_df, aes(x = umap_1, y = umap_2, color = copykat.pred)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() +
  facet_wrap(vars(reference_used))

Validate common CNAs found in Wilms tumor

To validate some of these annotations, we can also look at some commonly found copy number variations in Wilms tumor patients:

  • Loss of Chr1p
  • Gain of Chr1q
  • Loss of Chr11p13
  • Loss of Chr11p15
  • Loss of Chr16q

Although these are the most frequent, there are patients who do not have any of these alterations and patients that only have some of these alterations. See Tirode et al., and Crompton et al..

CopyKAT outputs a matrix that contains the estimated copy numbers for each gene in each cell. We can read that in and look at the mean estimated copy numbers for each chromosome across each cell. We might expect that tumor cells would show an increased estimated copy number in Chr1q, and/or a loss of Chr1p, Chr11p and Chr16q.

# read in full gene by cell copy number detection results
full_ck_results_df <- full_ck_result_paths |>
    purrr::map(readr::read_tsv) |>
    dplyr::bind_rows(.id = "reference_used")
                                                                                                        
# for every cell, calculate the mean detection level across all genes in a given chromosome
full_cnv_df <- full_ck_results_df |>
    tidyr::pivot_longer(
    cols = -c(
    reference_used,
    chrom
    ),
    names_to = "barcodes",
    values_to = "cnv_detection"
    ) |>
    dplyr::group_by(chrom, barcodes, reference_used) |>
    dplyr::summarise(mean_cnv_detection = mean(cnv_detection))
                                                                                                        
# join with cnv info
cnv_df <- cnv_df |>
   dplyr::left_join(full_cnv_df, by = c("barcodes", "reference_used")) |>
   dplyr::filter(!is.na(chrom))    

Let’s look at the distribution of CNV estimation in cells that are called aneuploid and diploid by CopyKAT.

# create faceted density plots showing estimation of CNV detection across each chr of interest
# colored by aneuploid/diploid estimation
ggplot(cnv_df, aes(x = mean_cnv_detection, color = copykat.pred)) +
  geom_density() +
  theme_bw() +
  facet_grid(
    rows = vars(chrom),
    cols = vars(reference_used)
  )

Conclusions

From the heatmap of CNV and the mean CNV detection plots, it is difficult to see any CNV pattern that drive the identification of aneuploid cells. The assignment of the aneuploidy/diploidy value might relies on very few CNV and/or an arbitrary threshold. This might be why the assignement of aneuploidy/diploidy values differs between condition (and between runs!!).

Session Info

# record the versions of the packages used in this analysis and other environment information
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
##  [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Vienna
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.2.0    lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1     
## [11] tidyverse_2.0.0    SCpubr_2.0.2       Seurat_5.1.0       SeuratObject_5.0.2 sp_2.1-4          
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.16.0       jsonlite_1.8.8          magrittr_2.0.3          spatstat.utils_3.1-0    farver_2.1.2            rmarkdown_2.28          DElegate_1.2.1         
##   [9] zlibbioc_1.50.0         vctrs_0.6.5             ROCR_1.0-11             spatstat.explore_3.3-2  memoise_2.0.1           htmltools_0.5.8.1       sass_0.4.9              sctransform_0.4.1      
##  [17] parallelly_1.38.0       KernSmooth_2.23-24      bslib_0.8.0             htmlwidgets_1.6.4       ica_1.0-3               plyr_1.8.9              plotly_4.10.4           zoo_1.8-12             
##  [25] cachem_1.1.0            igraph_2.0.3            mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3         Matrix_1.7-0            R6_2.5.1                fastmap_1.2.0          
##  [33] GenomeInfoDbData_1.2.12 fitdistrplus_1.2-1      future_1.34.0           shiny_1.9.1             digest_0.6.37           colorspace_2.1-1        AnnotationDbi_1.66.0    S4Vectors_0.42.1       
##  [41] tensor_1.5              rprojroot_2.0.4         RSpectra_0.16-2         irlba_2.3.5.1           RSQLite_2.3.7           labeling_0.4.3          progressr_0.14.0        timechange_0.3.0       
##  [49] spatstat.sparse_3.1-0   fansi_1.0.6             polyclip_1.10-7         abind_1.4-5             httr_1.4.7              compiler_4.4.1          bit64_4.0.5             withr_3.0.1            
##  [57] DBI_1.2.3               fastDummies_1.7.4       highr_0.11              MASS_7.3-61             tools_4.4.1             lmtest_0.9-40           httpuv_1.6.15           future.apply_1.11.2    
##  [65] goftest_1.2-3           glue_1.7.0              nlme_3.1-166            promises_1.3.0          grid_4.4.1              Rtsne_0.17              cluster_2.1.6           reshape2_1.4.4         
##  [73] generics_0.1.3          spatstat.data_3.1-2     gtable_0.3.5            tzdb_0.4.0              data.table_1.16.0       hms_1.1.3               utf8_1.2.4              XVector_0.44.0         
##  [81] spatstat.geom_3.3-2     BiocGenerics_0.50.0     RcppAnnoy_0.0.22        ggrepel_0.9.5           RANN_2.6.2              pillar_1.9.0            spam_2.10-0             RcppHNSW_0.6.0         
##  [89] vroom_1.6.5             later_1.3.2             splines_4.4.1           lattice_0.22-6          deldir_2.0-4            survival_3.7-0          bit_4.0.5               tidyselect_1.2.1       
##  [97] Biostrings_2.72.1       miniUI_0.1.1.1          pbapply_1.7-2           knitr_1.48              gridExtra_2.3           IRanges_2.38.1          scattermore_1.2         stats4_4.4.1           
## [105] xfun_0.47               Biobase_2.64.0          matrixStats_1.3.0       DT_0.33                 stringi_1.8.4           UCSC.utils_1.0.0        lazyeval_0.2.2          yaml_2.3.10            
## [113] evaluate_0.24.0         codetools_0.2-20        cli_3.6.3               uwot_0.2.2              xtable_1.8-4            reticulate_1.38.0       munsell_0.5.1           jquerylib_0.1.4        
## [121] Rcpp_1.0.13             GenomeInfoDb_1.40.1     spatstat.random_3.3-1   globals_0.16.3          png_0.1-8               spatstat.univar_3.0-0   parallel_4.4.1          blob_1.2.4             
## [129] dotCall64_1.1-1         listenv_0.9.1           viridisLite_0.4.2       scales_1.3.0            ggridges_0.5.6          leiden_0.4.3.1          crayon_1.5.3            rlang_1.1.4            
## [137] cowplot_1.1.3           KEGGREST_1.44.1